library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(lme4)
library(rstatix)
options(scipen=999)
df = read.csv(here::here('avg_tensor_by_roi_wide.csv'),
colClasses = c('subject' = 'factor',
'site' = 'factor',
'visit' = 'factor')) %>%
rename(scan = visit)
outcomes = df %>%
select(where(is.numeric)) %>%
colnames
The purpose of the DTIMSCAMRAS study is to test for site effects on DTI metrics of different brain regions in the context of a traveling subjects design with a harmonized imaging protocol. Participants (n=11) with stable Multiple Sclerosis (MS) were scanned in 4 different locations: National Institute of Health (NIH), Brigham and Women’s Hospital (BWH), Johns Hopkins University (Hopkins), and The University of Pennsylvania (Penn). Two scans were collected per site visit, and participants were re-positioned between scans. The modalities that were collected include T1s (MPRAGE), FLAIR, and Diffusion-weighted images (DWI).
To preprocess imaging data, qsiprep was used for bias correction, skull-stripping and co-registration of the T1s and DWIs, and specialized denoising and artifact correction for the DWIs. Then, DTIFIT was used to compute the following scalar maps:
Concurrently, several segmentation pipelines were used on the T1s (+ FLAIRs in the case of MIMOSA) with the purpose of defining particular regions of interest (ROIs):
The segmentation results were used to average the scalar maps across all voxels belonging to each of the ROIs. The 36 resulting metrics make up the set of outcome variables which are the focus of the site effects test. Below, the outcome variables are listed.
data.frame(OUTCOME = outcomes) %>%
separate(OUTCOME, c("SEGMENTATION", "ROI", "SCALAR"), remove = FALSE)
Note the general
A permutation-based F-test was used to test for site effects. For each subject, the absolute difference between all possible pairs of intra-site and inter-site measures were computed. For instance, take the first row of the subject data.frame below, which has BWH as the “reference site”: the absolute differences between that row and the 6 non-BWH rows are computed for \(y\)——this process is repeated for all rows (and the results averaged) to arrive to the average inter-site difference for this subject; for the average intra-site differences the absolute difference of the 4 intrasite pairs are averaged.
df %>%
filter(subject == '01001') %>%
select(!where(is.numeric), ATROPOS_GM_AD) %>%
rename(y = ATROPOS_GM_AD)
These differences were averaged across all subjects resulting in the Mean Absolute Difference Between Sites (\(\overline{MAD_{B}}\)) and the Mean Absolute Difference Within Sites (\(\overline{MAD_{W}}\)). The test statistic is composed of their ratio:
\[ F = \frac{\overline{MAD_{B}}}{\overline{MAD_{W}}} \]
For each metric, the observed test statistic was compared to a distribution of test statistics derived from 10000 permutations (i.e. a null distribution). The proportion of null results higher than the observed test statistic served as a preliminary p-value \(p_0\). To prevent p-values of 0 (when the observed statistic was higher than all permuted results), the p-value was shifted using the following formula:
\[ p = \frac{10000*p_0 + 1}{10000 + 1}\]
As an additional test of site effects, mixed models were performed corresponding to a crossed design using the lmer function of the lme4 package. For each outcome variable, two mixed-effects models were fitted: The formula for the base model includes a random intercept for site as well as a random intercept for subject. The formula for the extended model also includes these terms in addition to a fixed effect term for site.
Base Model:
y ~ (1|subject) + (1|site)
Extended model:
y ~ site + (1|subject) + (1|site)
For each outcome, these two models were compared through a deviance test using the anova function in R. P-values for these tests are reported as a test of site effects.
As yet another test of site effects, a repeated-measures ANOVA was performed for each metric. Here, only data from scan 1 was used. The models were run using the anova_test function from the rstatix package. In these models, site was specified as a within-subject factor.
Subjects 03001 and 03002 did not complete imaging at all 4 sites. Subject 03001 completed imaging at Penn and BWH only and 03002 completed imaging at BWH, Hopkins, and the NIH only.
Subject 02001 is missing DTI data from their NIH visit.
t_df <- df %>%
count(subject, site, .drop = FALSE) %>%
tidyr::pivot_wider(names_from = site, values_from = n) %>%
mutate(`Row Totals` = BWH + Hopkins + NIH + Penn)
t_df <- t_df %>%
summarize(across(where(is.integer), sum)) %>%
bind_rows(t_df, .) %>%
mutate(across(where(is.factor), as.character))
t_df[is.na(t_df)] <- "Column Totals"
t_df
Atropos segmentation failed for 6 images: 04001NIH01, 01003NIH01, 03002NIH02, 04003NIH01, 04001BWH02, and 03001BWH01.
knitr::include_graphics(here::here(c('misc/atropos_success.png', 'misc/atropos_fail.png')))
Left: Example of successful Atropos segmentation; Right: Example of a failed Atropos segmentation
The following table excludes the failed atropos segmentations. Note subject 03001 only has 3 images after excluding failed atropos segmentations.
t_df <- df %>%
unite(sub, subject, site, scan, sep = '-') %>%
filter(!sub %in% c('04001-NIH-01', '01003-NIH-01', '03002-NIH-02', '04003-NIH-01', '04001-BWH-02', '03001-BWH-01')) %>%
separate(sub, c('subject', 'site', 'scan')) %>%
mutate_if(is.character, as.factor) %>%
count(subject, site, .drop = FALSE) %>%
tidyr::pivot_wider(names_from = site, values_from = n) %>%
mutate(`Row Totals` = BWH + Hopkins + NIH + Penn)
t_df <- t_df %>%
summarize(across(where(is.integer), sum)) %>%
bind_rows(t_df, .) %>%
mutate(across(where(is.factor), as.character))
t_df[is.na(t_df)] <- "Column Totals"
t_df
The failed atropos segmentations are included in the following visualizations but were excluded for the site effects tests.
Densities are colored by site, while the black line represents the overall density aggregated across sites. Distribution of the different sites tend to cluster together except for JLF Thalamus.
plot_densities <- function(var){
df %>%
ggplot(aes_string(x=var, color='site')) +
geom_density() +
stat_density(aes_string(x = var), geom = "line", color = "black") +
theme_bw() +
xlab(str_replace_all(var, "_", " "))
}
vars <-df %>%
select(ends_with('FA')) %>%
colnames()
dplots <- lapply(vars, plot_densities) %>%
setNames(vars)
for (name in names(dplots)){
cat("#####", str_replace_all(name, "_", " "), "\n")
print(dplots[[name]])
cat("\n\n")
}
Box plots show difference in medians across sites for most variables.
Additionally, note the severe outliers in the ATROPOS metrics which correspond to the failed segmentations. For FIRST THALAMUS, subject 01002 could be considered an outlier; their FA values are consistent across sites and scans, however, suggesting this variation is meaningful and that this subject should be retained.
plot_avg_tensor_values <- function(tensor){
df %>%
select(!is.numeric, ends_with(tensor)) %>%
pivot_longer(ends_with(tensor), names_to = 'seg', values_to = 'average') %>%
separate(seg, into = c('segmentation', 'roi', 'tensor')) %>%
unite(segmentation, segmentation, roi, sep = " ") %>%
ggplot(aes(x=site, y=average)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = subject, shape=scan), alpha=0.5) +
#facet_grid(site~.) +
facet_grid(segmentation ~ .) +
coord_flip() +
ggtitle(sprintf("Mean %s in ROI by site", tensor)) +
theme_bw() +
# theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1)) +
scale_x_discrete(expand = c(.00000001, .00000001)) +
xlab("") +
ylab("") +
theme(strip.text.y = element_text(angle = 0))
}
plot_avg_tensor_values('FA')
For mean diffusivity, distributions vary more widely across sites. In particular MIMOSA LESION MD values are quite different for Hopkins data.
vars <-df %>%
select(ends_with('MD')) %>%
colnames()
dplots <- lapply(vars, plot_densities) %>%
setNames(vars)
for (name in names(dplots)){
cat("#####", str_replace_all(name, "_", " "), "\n")
print(dplots[[name]])
cat("\n\n")
}
As before, medians appear to be different across sites.
plot_avg_tensor_values('MD')
Radial Diffusivity also shows some variations across sites (perhaps less than MD). JLF GM RD in NIH shows multiple peaks, and JLF WM RD shows skewed distributions for all sites, pointing to potential non-normality. As before the metric for MIMOSA shows a different distribution for Hopkins data.
vars <-df %>%
select(ends_with('RD')) %>%
colnames()
dplots <- lapply(vars, plot_densities) %>%
setNames(vars)
for (name in names(dplots)){
cat("#####", str_replace_all(name, "_", " "), "\n")
print(dplots[[name]])
cat("\n\n")
}
As before, medians appear to be differet across sites.
plot_avg_tensor_values('RD')
For Axial Diffusivity, Hopkins distributions show marked differences across most ROIs compared to other sites. The difference is quite pronounced for MIMOSA.
vars <-df %>%
select(ends_with('AD')) %>%
colnames()
dplots <- lapply(vars, plot_densities) %>%
setNames(vars)
for (name in names(dplots)){
cat("#####", str_replace_all(name, "_", " "), "\n")
print(dplots[[name]])
cat("\n\n")
}
As before, medians show differences, particularly for Hopkins.
plot_avg_tensor_values('AD')
# replace ATROPOS measures with NA for select images (segmentation failed)
atropos_cols <- df %>%
select(contains('ATROPOS')) %>%
colnames()
df[(df$subject == '04001' & df$site == 'NIH' & df$scan == '01') |
(df$subject == '01003' & df$site == 'NIH' & df$scan == '01') |
(df$subject == '03002' & df$site == 'NIH' & df$scan == '02') |
(df$subject == '04003' & df$site == 'NIH' & df$scan == '01') |
(df$subject == '04001' & df$site == 'BWH' & df$scan == '02') |
(df$subject == '03001' & df$site == 'BWH' & df$scan == '01'), atropos_cols] <- NA
The permutation-based F-test was carried out as previously described. The F-statistic for each of the metrics is plotted below as a black dot, while the distribution of permuted test statistics is shown as a white violin plot. All metrics showed significant site effects.
ratio_stats = readRDS(here::here("results/avg_tensor_by_roi_wide_ratio_stat.rds"))
null_dist = readRDS(here::here("results/avg_tensor_by_roi_wide_null.rds"))
null_dist %>%
pivot_longer(everything()) %>%
ggplot(aes(x=name, y=value)) +
geom_violin(draw_quantiles = c(0.95)) +
geom_point(aes(x=name, y=value),
data = ratio_stats %>%
pivot_longer(everything())) +
coord_flip()
The observed values (“black dots”) are represented here in table form along with their unajusted and adjusted p-values:
P-value of pseudo-F stat:
test_ratio_stat = function(ratio.stats, null.dists){
p.vals = c()
for (col in colnames(ratio.stats)){
ratio.stat = ratio.stats[, col]
null.dist = null.dists[, col]
percentile = ecdf(null.dist)
p.vals = c(p.vals, percentile(ratio.stat))
}
names(p.vals) = colnames(ratio.stats)
return(p.vals)
}
p = 1 - test_ratio_stat(ratio_stats, null_dist)
p_df <- data.frame(p_val = (10000*p + 1)/(1+10000))
perm_df <- cbind(
ratio_stats %>%
pivot_longer(everything(), names_to = 'outcome', values_to = 'observed_statistic'),
p_df %>%
mutate(p_val_fdr = p.adjust(p_val, method = 'fdr'))
)
rownames(perm_df) <- NULL
perm_df
Two mixed models (a base and an extended model) were performed as described above. The p-values for these tests are arranged in increasing order below.
models <- outcomes %>%
lapply(function(x) lmer(reformulate("site + (1|subject) + (1|site)", x), data=df)) %>%
setNames(outcomes)
models_0 <- outcomes %>%
lapply(function(x) lmer(reformulate("(1|subject) + (1|site)", x), data=df)) %>%
setNames(outcomes)
anovas <- purrr::map2(models, models_0, ~ anova(.x, .y)) # deviance test between the two models
# pull p.values from all tests
p_vals <- anovas %>%
purrr::map(~ broom::tidy(.x) %>%
pull(p.value) %>%
na.omit()) %>%
purrr::reduce(c)
anova_test_df <- data.frame(outcome = names(anovas), p_val = p_vals) %>%
arrange(p_val) %>%
mutate(p_val_fdr = p.adjust(p_val, method = 'fdr'))
anova_test_df
Note that, prior to FDR-adjustment, most outcome variables show a significant test indicating the presence of site effects. However, after FDR-adjustment, the vast majority of tests become non-significant.
Repeated-Measures ANOVAs were performed as described above. Below the table shows the F-statistic, effect size (generalized eta squared) and p-values (adjusted and unadjusted).
options(scipen=999)
find_icc <- function(data){
irr::icc(data,
model = "twoway",
type = "agreement",
unit = "single")
}
run_anova <- function (var){
df %>%
filter(scan == '01') %>%
anova_test(dv = var, wid = subject, within = site) %>%
get_anova_table()
}
anovas <- outcomes %>%
lapply(run_anova) %>%
setNames(outcomes)
data.frame(measure = names(anovas),
F_stat = anovas %>%
purrr::map(~ .x$F) %>%
purrr::reduce(c),
ges = anovas %>%
purrr::map(~ .x$ges) %>%
purrr::reduce(c),
p = anovas %>%
purrr::map(~ .x$p) %>%
purrr::reduce(c)) %>%
arrange(p) %>%
mutate(p_val_fdr = p.adjust(p, method = "fdr"))
ICC between scan 1 and scan 2 was calculated for each site. Results are shown below. Most sites show high agreement between scans across metrics.
pair_df_list <- df %>%
pivot_longer(where(is.numeric)) %>%
split(list(.$site, .$name))
# compute names for each pair_df
df_names <- pair_df_list %>%
purrr::map(~ .x %>%
transmute(df_name = paste(site, name, sep="-")) %>%
pull() %>%
unique()
) %>%
purrr::reduce(c)
# finish transformation
pair_df_list <- pair_df_list %>%
purrr::map(~ .x %>%
pivot_wider(names_from = c('name', 'scan')) %>%
select(where(is.numeric))
) %>%
setNames(df_names)
icc_df <- pair_df_list %>%
purrr::map(find_icc) %>%
purrr::imap(~ data.frame(outcome = .y,
icc = .x$value,
lwr = .x$lbound,
upr = .x$ubound,
p_val = .x$p.value)) %>%
dplyr::bind_rows(.id = 'outcome') %>%
separate(outcome, into = c('site', 'outcome'), sep='-')
options(scipen=999)
icc_df